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IMPROVEMENTS IN, OR RELATING TO, SPEECH-TO-SPEECH CONVERSION 

The present invention relates to a system and method for 
speech-to-speech conversion and to a voice responsive 
communication system including a speech-to-speech conversion 
5 system. 

Known speech recognition systems which are adapted zo 
provide spoken responses to speech inputs, include databases 
which contain speech information in many different languages 
and provide a recognition function for recognising and 

10 interpreting information in the languages concerned. However, 
the known speech recognition systems which may form part of 
speech-to-speech conversion systems, or the like, are dedicated 
to a single language, i.e. will only respond to speech inputs, 
e.g. spoken enquiries/questions, in the particular language 

15 which the system is adapted to handle and process. 

In addition, the speech information ^ata, which is stored 
in a database and used for the formulation of appropriate 
synthesised spoken responses to the speech inputs, is normally 
reproduced in a dialect which conforms to a standard national 

20 dialect. Thus, when there are significant differences between 
the dialect of the speech inputs and the standard national 
dialect, it may prove difficult, in certain circumstances, for 
the database of known speech-to-speech conversion systems to 
correctly interpret received speech information, i.e. the voice 

25 inputs to the system. It may also- be difficult for the person 
making the voice inputs to fully understand the spoken 
response. Even if such responses are understandable to a 
recipient, it would be more user friendly if the dialect of the 
spoken response is the same as the dialect of the related voice 

30 input. 

Also, with artificial reproduction of a spoken language, 
there is a need for the language to be reproduced naturally and 
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with the correct accentuation. In particular, a word can have 
widely different meanings depending on language stress. Alsc, 
the meaning of one and the same sentence can depend on where 
the stress is placed. Furthermore, the stressing of sentences, 
or parts thereof, determines sections which are emphasised in 
the language and which may be of importance in determining the 
precise meaning of the spoken language. 

The need for artificially produced speech to be as natural 
as possible and have the correct accentuation is of particular 
importance in voice responsive communication devices and/or 
systems which produce speech in various contexts. With known 
voice responsive arrangements, the reproduced speech is 
sometimes difficult to understand and interpret- There is, 
therefore, a need for a speech-to-speech conversion system in 
which the artificial speech outputs are natural, have the 
correct accentuation, and are readily understandable. 

With languages having well developed sentence accent: 
stress and/or pitch in individual words, identification of the 
natural meaning of the words/sentences is very difficult. The 
fact that stresses can be incorrectly placed increases the risk 
of misinterpretation, or that the meaning is completely losx 
for the listening party. 

Thus, in order to overcome these difficulties, it would 
be necessary for a speech-to-speech conversion system to be 
capable of interpreting the received speech information, 
irrespective of language and/or dialect, and to match the 
language and/or dialect of speech outputs to that of the 
respective speech inputs. Also, in order to be able zo 
determine the meaning of single words, or phrases, in a: 
unambiguous manner in a spoken sequence, it would be necessar^ 
for the speech-to-speech conversion systems to be capable c 
determining, and taking account of, sentence accent an 
sentence stresses in the spoken sequence. 
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It is an object of the present invention to provide a 
system and method for speech-to-speech conversion which are 
adapted to recognize, interpret and process speech inputs in 
at least two natural languages and provide speech outputs, i.e. 
spoken responses, in the same language as the respective speech 
inputs . 

It is another object of the present invention to provide 
a system and method for speech-to-speech conversion which are 
adapted to recognize, interpret and process speech inputs in 
at least two natural languages and provide speech outputs, i.e. 
spoken responses, in the same language and dialect as the 
respective speech inputs, the matching of the dialect being 
effected using prosody information and, more particularly, the 
fundamental tone curve of the speech inputs. 

^5 ^ further object of the present invention to provide 
a voice responsive communication system, including a speech-to- 
lipeech conversion system, operating in„, accordance with a 
speech-to-speech conversion method. 



20 



25 



30 



The invention provides, in a voice responsive 
communication system, a method for providing a spoken response 
to a speech input, said method including the steps of 
recognising and interpreting the speech input, and utilising 
the interpretation to obtain speech information data from a 
database for use in the formulation of the spoken response, 
characterised in that the database contains speech information 
data in at least two natural languages, in that said method is 
adapted to recognise and interpret speech inputs in said at 
least two languages and to provide spoken responses to speech 
inputs in said languages, and in that said method includes the 
further steps of evaluating a recognised speech input to 
determine the language of the input, effecting a dialogue with 
the database to obtain speech information data for the 
formulation of a spoken response in the language of the speech 
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input, and converting the speech information data, obtained 
from the database, into said spoken response. 

In a preferred method, separate databases may be used for 
each of said at least two languages, and dialogue may be 
5 effected with only that one of said databases which contains 
speech information data in the language of the input speech. 
However, in the event that at least part of the required speech 
information data for a spoken response is stored in another of 
said databases, the method may include the further steps of 

10 effecting a dialogue with said another database to obtain the 
required speech information data, translating the information 
data into the language of said one of the databases, combining 
the speech information data from the databases, and converting 
the combined speech information data into a spoken response in 

15 the language of the speech input. 

The speech recognition and interpretation of a speech 
input may be effected in at least two natural languages. In 
this case, recognised parts, or sequences, of the speech input, 
resulting from the speech recognition and interpretation in the 
20 said at least two natural languages, are evaluated to determine 
the language of the speech input. The outcome of this 
evaluation process may be used to determine the database with 
which said dialogue is conducted to obtain the speech 
information data for a spoken response to the speech input. 

25 The dialogue with a database, and/or between databases, 

may be effected using a database communication language, such 
as SQL (Structured Query Language) . 

In a preferred method, according to the present invention, 
the speech recognition and interpretation includes the steps 
30 of extracting prosody information, i.e. the fundamental tone 
curve, from a speech input, and obtaining dialect information 
from said prosody information, said dialect information being 
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used in the conversion of said speech information data, 
obtained from said database, into a spoken response, the sooken 
responses being in the same language and dialect as the speecr. 
input. This preferred method, includes the further steps of 
5 determining the intonation pattern of the fundamental tone and 
thereby the maximum and minimum values of the fundamental tone 
curve and their respective positions; determining the 
intonation pattern of the fundamental tone curve of a speech 
model and thereby the maximum and minimum values of the 
10 fundamental tone curve and their respective positions; 
comparing the intonation pattern of the input speech with the 
intonation pattern of the speech model to identify a time 
difference between the occurrence of the maximum and minimum 
values of the fundamental tone curves of the incoming speech 
15 m relation to the maximum and minimum values of the 
fundamental tone curve of the speech model, the identified timi 
difference being indicative of dialectal characteristics of the 
input speech. The time difference may be determined in 
relation to an intonation pattern reference point, for example, 
20 the point at which a consonant/vowel limit occurs. 



30 



ma- 



The method, according to the present invention, 
include the step of obtaining information on sentence accents 
from said prosody information. 

The words in the speech model may be checked lexically, 
25 and the phrases in the speech model may be checked 
syntactically. The words and phrases which are no- 
linguistically possible are excluded from the speech model. 
In addition, the orthography and phonetic transcription of the 
words in the speech model may be checked, the transcription 
information including lexically abstracted accent information, 
of type stressed syllables, and information relating to the 
location of secondary accent. The accent information mav 
relate to tonal word accent I and accent II. 
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In addition, the method, according to the present 
invention, may use sentence accent information in the 
interpretation of the input speech. 

The invention also provides a speech-to-speech conversion 
system for providing, at the output thereof, spoken responses 
to speech inputs in at least two natural languages, including 
speech recognition means for the speech inputs; interpretation 
means for interpreting the content of the recognised speech 
inputs, and a database containing- speech information data for 
use in the formulation of said spoken responses, characterised 
in that the speech information data stored in the database is 
in the said at least two natural languages, in that the speech 
recognition and interpretation means are adapted to recognise 
and interpret speech inputs in said at least two natural 
languages, and in that the system further includes evaluation 
means for evaluating the recognised speech inputs and 
determining the language of the inputs, dialogue managemenr 
means for effecting a dialogue with the database to obtain said 
speech information data in the language of a speech input, and 
text-to-speech conversion means for converting the speech 
information data, obtained from the database, into a spoken 
response. 

The speech-to-speech conversion system, according to the 
present invention, which is adapted to receive speech inputs 
in two, or more, natural languages and to provide, at the 
output thereof, spoken responses in the language of the 
respective speech inputs, preferably includes, for each of the 
natural languages, speech recognition means, the inputs of each 
of the speech recognition means being connected to a common 
input for the system; speech evaluation means for determining, 
in dependence on the output of each of the speech recognition 
means, the language of a speech input; a database containing 
speech information data for use in the formulation of spoken 
responses in the language of the database; dialogue management 
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means for connection to a respective speech recognition means, 
in dependence on the language of the speech input, said 
management means being adapted to interpret the content of 
recognised speech and, on the basis of the interpretation, to 
5 access, and obtain speech information data from, at least a 
respective one of the databases; and text-to-speech conversion 
means for converting the speech information data obtained by 
said management means into spoken responses to respective 
speech inputs. 

^° '^^^ speech-to-speech conversion system may include 

separate databases for each of said at least two languages, 
and separate dialogue management means for each of the 
databases, each dialogue management means being adapted to 
effect a dialogue with, at least, a respective one of the 
databases. Also, each dialogue management means may be adapted 
to effect a dialogue with each of the databases. In this case, 
the system includes translation means for translating the 
speech information data output of each of the databases into 
the language of the other databases. 



15 



20 



25 



30 



In the event that at least part of the required speech 
information data for a spoken response is stored in a database 
in a different language to that which is required for the 
spoken response, the speech information data may be obtained 
from said database and translated by said translation means 
into the required language for the spoken response. The 
translated speech information is then used either alone, or in 
combination with other speech information, by the dialogue 
management means to provide an output for application to the 
text-to-speech conversion means. 

The speech-to-speech conversion system is preferably 
adapted to receive speech inputs in two languages, in which 
case, the system includes, for each of the two languages, a 
database, dialogue management means and translation means, in 

SUBSTITUTE SHEET (RULE 26) 



ICID: <WO 9743707A1_r_> 



wo 97/43707 PCT/SE97/00584 



that each of the dialogue management means is adapted to 
communicate with each of the databases, the data output of each 
of the databases being connected directly to one of the 
dialogue management means and the other of the management means 
5 via a translation means. 

The speech-to-speech conversion system preferably includes 
speech recognition and interpretation means for each of the 
said at least two natural languages, the inputs to the speech 
recognition and interpretation means being connected to a 

10 common system input. The recognised parts, or sequences, of 
the speech input, resulting from said speech recognition and 
interpretation in the said at least two natural languages, are 
evaluated by the evaluation means to determine the language of 
the speech input. The evaluation means may be used, in a 

15 preferred system, to select the database from which said speech 
information data will be obtained by said dialogue management 
means for the formulation of the spoken response to the speech 
input. 

The speech recognition and interpretation means may 
20 include extraction means for extracting prosody information 
from the speech input, and means for obtaining dialectal 
information from said prosody information, said dialectal 
information being used by said text-to-speech conversion means 
in the conversion of said speech information data into the 
25 spoken response, the dialect of the spoken response being 
matched to that of the speech input. The prosody information 
extract from the speech input is the fundamental tone curve of 
the speech input. 

The means for obtaining dialectal information from said 
30 prosody information may include first analysing means for 
determining the intonation pattern of the fundamental tone of 
the input speech and thereby the maximum and minimum values of 
the fundamental tone curve and their respective positions; 
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second analysing means for determining the intonation pattern 
of the fundamental tone curve of the speech model and thereby 
the maximum and minimum values of the fundamental tone curve 
and their respective positions; comparison means for comparing 
5 the intonation pattern of the input speech with the intonation 
pattern of the speech model to identify a time di f f erence 
between the occurrence of the maximum and minimum values of the 
fundamental tone curves of the incoming speech in relation to 
the maximum and minimum values of the fundamental tone curve 
10 of the speech model, the identified time difference being 
indicative of dialectal characteristics of the input speech. 
The time difference may be determined in relation to an 
intonation pattern reference point, i.e, the point at which a 
consonant /vowel limit occurs. 

15 The speech-to-speech conversion system may also include 

means for obtaining information on sentence accents from said 
prosody information. 

The speech recognition means may include checking means 
for lexically checking the words in the speech model and for 

20 syntactically checking the phrases in the speech model, the 
words and phrases which are not linguistically possible being 
excluded from the speech model. The checking means may be 
adapted to check the orthography and phonetic transcription of 
the words in the speech model, in which case, the transcription 

25 information includes lexically abstracted accent information, 
of type stressed syllables, and information relating to the 
location of secondary accent. The accent information may 
relate to tonal word accent I and accent II. 

The sentence accent information may be used in the 
30 interpretation of the content of the recognised input speech. 

The sentence stresses may be determined and used in the 
interpretation of the content of the recognised input speech. 
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The invention further provides a voice responsive 
communication system which includes a speech-to-speech 
conversion system as outlined in the preceding paragraphs, or 
utilises a method as outlined in the preceding paragraphs for 
providing a spoken response to a speech input to the system. 

The foregoing and other features of the present invention 
will be better understood from the following description, with 
reference to the single figure of the accompanying drawings, 
which illustrates, in the form of a block diagram, a speech-to- 
speech conversion system, according to the present invention. 

The speech-to-speech conversion system, according to the 
present invention, is adapted to provide, at the output 
thereof, spoken responses to speech inputs in at least two 
natural languages. The language characteristics of the spoken 
responses, for example, dialect, sentence accent and sentence 
stresses, are matched, by the present invention, to those of 
the input speech to provide natural speech outputs which can 
be readily understood, have the correct accentuation and give 
rise to a user friendly system. It will be seen from the 
following description that the matching of the language 
characteristics is achieved by extracting prosody information 
from the speech input, i.e. the fundamental tone curve of the 
speech input, and using the prosody information to determine 
dialectal, sentence accent and sentence stressing, information 
for use in the formulation of the spoken responses. 

The speech-to-speech conversion system may, therefore, be 
used in many applications, for example, in voice responsive 
communication systems to effect a dialogue between a user of 
the system and a database which forms part of the system's 
speech recognition unit and which contains speech information 
data for the formulation of spoken responses to spoken 
questions/enquiries from users of the system. Such voice 
responsive communication systems could be used in 
telecommunications, or banking, or security, etc., to provide 
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15 



a readily understandable, user friendly, system. 

The speech-to-speech conversion system, illustrated in the 
single figure of the accompanying drawings, is adapted to 
provide, at the output thereof, spoken responses to speech 
inputs in two natural languages, i.e. languages A and B, which 
may be any natural language, for example Swedish and English. 

As shown in the accompanying drawing, the system includes 
speech recognition and interpretation units 1 and 2, 
respectively, for the languages A and B. The inputs of the 
units 1 and 2 are connected to a common input for the system. 
The speech recognition and interpretation units 1 and 2 are 
used to recognise and interpret the content of the speech input 
m a manner zo be subsequently outlined. 

An output of each of the units 1 and 2 are connected to 
separate i.nputs of an evaluation unit 3 which is adapted to 
evaluate the recognised speech inputs and determine the 
language cf the inputs, i.e. language A, or language B. 

The system of the present invention also includes two 
switching units 4 and 5, the inputs of which are respectively 
connected to an output of the speech recognition and 
interpretation units 1 and 2 . Operation of the switching units 
4 and 5 is controlled, in a manner to be subsequently outlined, 
by the evaluation unit 3, i.e. the control inputs to the units 
4 and 5 are respectively connected to separate outputs of the 
25 evaluation unit 3. 

The outputs of the switching units 4 and 5 are 
respectively connected to an input of dialogue management units 
6 and 7. ic will be seen from subsequent description that the 
dialogue management units 6 and 7 are used to effect a dialogue 
with database units 8 and 9 to obtain speech information data, 
in the language of a speech input, for use in the formulation 
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of the spoken responses. 

A lexicon and syntax unit 10 for the language A is 
connected to another output of the speech recognition and 
interpretation unit 1, to the dialogue management unit 6 and 
5 to an input of a text-to-speech conversion unit 12. 

A lexicon and syntax unit 11 for the language B is 
connected to another output of the speech recognition and 
interpretation unit 2, to the dialogue management unit 7 and 
to an input of a text-to-speech conversion unit 13. 

The text-to-speech conversion units 12 and 13 are also 
respectively connected, at another input thereof, to an output 
of dialogue management units 6 and 7 . 

The outputs of the text-to-speech conversion units 12 and 
13 are connected to a common speech output for the system. 

AS shown in the accompanying drawing, there is a two way 

communication path between the dialogue management unit 6 and 

database unit 8, and between the dialogue management unit 7 and 

database unit 9. These communication paths are used to effect, 

in a manner to be subsequently outlined, a dialogue between the 

respective management and database units to obtain speech 

information data for use in the formulation of the spoken 

responses. The two way communication paths are interconnected 

to enable a dialogue to be undertaken between management unit 

6 and database unit 9 and/or between management unit 7 and 

o Tn r^T-ar-i- i rp the dialoQue with a database 
database unit 8. In practice, tne ua.aj.vyuc 

unit, and/or between databases units, is . effected using a 
database communication language, such as SQL {Structured Query 
Language) . 

A translation unit 14 is provided for translating language 
A into language B and vice versa. It will be seen from the 
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accompanying drawing that one section 14a of the translation 
unit 14 has an input for language B which is connected to an 
output of database unit 9, and an output for language A which 
is connected to an input of dialogue management unit 6. 
5 Another section 14b of the translation unit 14 has an input for 
language A which is connected to an output of database unit 8, 
and an output for language B which is connected to an input of. 
dialogue management unit 7. 

10 The manner in which the speech-to-speech conversion system 

is adapted to receive speech inputs in natural languages A and 
B, and to provide, at the output thereof, spoken responses in 
the language of the respective speech input, is outlined in the 
following paragraphs. 

15 A speech input to the speech-to-speech conversion system 

which can be in either language A, or language B, is recognised 
and interpreted by each of the speech recognition and 
interpretation units 1 and 2, in a^ssociation with the 
respective lexicon and syntax units 10 and 11, i.e. using 

20 statistically-based speech recognition and language modelling 
techniques, and ensuring that the recognised words and/or word 
combinations which are used to form a model of the speech 
input, are acceptable both lexically and syntactically. The 
purpose of the lexicon/syntax checks is to identify and exclude 

2 5 any words from the speech model which do not exist in the 
language concerned, and/or any phrase whose syntax does not 
correspond with the language concerned. 

The speech models respectively created by the units 1 and 
10, and the units 2 and 11, are applied to, and evaluated by, 
30 the evaluation unit 3 which determines which of the languages 
A and B is most probable for the speech input. This evaluation 
is effected on the basis of probability, i-e. the probability 
that the speech input is one, or other, of the languages A and 
B, the differences between the speech models, and whether the 
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language modelling for one, or other, of the languages has been 
successfully completed. The greater the difference between the 
language characteristics of languages A and B, the easier will 
be the task of the evaluation unit 3. 

5 Depending on the outcome of the evaluation by the unit 3, 

i.e. the selected language of the speech input, one of the 
switching units 4 and 5 will be activated to connect the speech 
recognition and interpretation unit for the selected language 
to the corresponding dialogue management unit. 

10 If it is assumed, for the purpose of this description, 

th^t language A has been selected as the most probable language 
for the speech input, then switching unit 4 will be activated 
and the output of speech recognition and interpretation unit 
1 will be connected to an input of dialogue management unit 6. 

15 Thus, the switching unit 5 will remain in a deactivated state 
and no connection will, therefore, be made between the dialogue 
management unit 9 and the speech recognition and interpretation 
unit 2. 

In the next stage of the speech-to-speech conversion 
20 process, the management unit 6 enters into a linguistic 
dialogue with the database unit 8, on the basis of the speech 
model of the speech input, to obtain speech information data 
for the formulation of a spoken response to the speech input. 
The speech information data, selected as result of this 

25 dialogue, is transferred via the management unit 6 to an input 
of the text-to-speech converting unit 5 for the formulation of 
a spoken response. It will be seen from subsequent description 
that the language characteristics of the spoken response is 
matched, as far as possible, to the language characteristics 

30 of the speech input. 

In the event that at least part of the required speech 
information data for a spoken response is not stored in the 
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database unit 6, but may be stored in the database unit 9, the 
dialogue management unit 6 enters into a dialogue with the 
database unit 9 to obtain the required speech information data. 
If the required speech information data is stored in database 

5 unit 9, it is accessed and transferred to the dialogue 
management unit 6 via section 14a- of the translation unit 14, 
i.e. is translated from language B into language A. The 
translated speech information data is then used either alone, 
or in combination with speech information data obtained from 

10 the database unit 8, to formulate the spoken response, i.e. 
converted by the text-to-speech conversion unit 12 into the 
spoken response. 

Clearly, if language B, rather than language A, is 
selected by the evaluation unit 3 as the language of the speech 

15 input, then the units 7, 9 and 13 would be used, in the same 
manner as outlined above for' the units 6, 8 and 12, for the 
formulation of the spoken response. Any information that may 
be required from the database unit 8 would be accessed by, and 
transferred to, the dialogue management unit 1, the translation 

20 of the transferred information data being effected by section 
14b of the translation unit 14. 

The recognition and interpretation of speech can give rise 
to technical problems and if these problems are not overcome, 
then difficulties will be experienced in obtaining a correct 

25 and meaningful interpretation of the speech inputs. In 
particular, if the recognition and interpretation of the speech 
inputs is incorrect, then it will be extremely difficult for 
the evaluation unit 3 to determine the language of the speech 
inputs and it will not, therefore, be possible to provide 

30 proper spoken responses to the speech inputs. 

Thus, in accordance with the present invention, these 
problems are overcome by extracting prosody information from 
the speech inputs and using this information to determine, in 
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a manner to be subsequently outlined, dialectal, sentence 
accent, and sentence stressing, information for use in tne 
recognition and interpretation process and in the formulation 
of the spoken responses. 

The extraction of the prosody information, i.e. the 

■Pv/^m i-h«» soeech input is effected by 
fundamental tone curve, from the speecn iuhu ^ ^- 

p.osoav e««ction .eans .not iUust.ated, which for. par 

the speech recognition and interpretation units 1 and 2 

units also include means (not illustrated) for obtaining 

dialectal information from the prosody information. 

Thus, with the present invention, the speech recognition 
and interpretation units 1 and 2 are adapted to operate, m a 
manner well known to persons skilled in the art, t° -cognise 
and interpret the speech inputs to the system. The speech 
recognition and interpretation units 1 and 2 may. for example 
operate by using a Hidden Markov model, or an equivalent speech 
„odel. in essence, the function of the .units 1 = ^ to 

convert speech inputs to the system into a form, which is a 
faithful representation of the content of the speech inputs, 
and which is suitable for evaluation by the evaluation unit . 
and use by the dialogue management units 6 and 7 . In other 
words, the content of the textual information data, at the 
output of each of the speech recognition and interpretation 
units 1 and 2, must be: 

- an accurate representation of the speech input; and 

be usable by the database management units 6 and 1 
to respectively access, and extract speecn 
information data from, the database units 8 and 9, 
for use in the formulation of a synthesised spoken 
response, i.e. by a respective one of the text-to- 
speech conversion units 12 and 13. 



30 
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in practice, the recognition and interpretation process 
would, in essence, be effected by identifying a number ot 
phonemes from a segment of the speech input which are combxnec 
into allophone strings, the phonemes being interpreted as 
5 possible words, or word combinations, to establish a model or 
the speech. The established speech model will have word ano 
sentence accents according to a standardised pattern for the 
language of the input speech. 

The information, concerning the recognised words and word 
10 combinations, generated by the speech recognition and 
interpretation unit 1 and 2, is checked, in a manner as 
outlined above, both lexically and syntactically. In practice, 
this would be effected using a lexicon with orthography ana 
transcription. 

Thus, in accordance with the present invention, the speech 
recognition and interpretation units 1 and 2 ensure that only 
those words, and word combinations, which are found to be 
acceptable both lexically and syntactically, are used to create 
a model of the input speech. In practice, the intonation 
20 pattern of the speech model is a standardised intonation 
pattern for the language concerned, or an intonation pattern 
which has been established by training, or explicit knowledge, 
using a number of dialects of the language concerned. 

As stated above, the prosody information, i.e. the 
25 fundamental tone curve, extracted from the input speech by the 
extraction unit 3, can be used to obtain dialectal, sentence 
accent and sentence stressing, information, for use by the 
speech-to-speech conversion system and method of the present 
invention. In particular, the dialectal information can be 
30 used by the speech-to-speech conversion system and method to 
match the dialect of the output speech to that of the input 
speech and the sentence accent and stressing information can 
be used in the recognition and interpretation of the input 
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speech , 

in accordance with the present invention, the means for 
obtaining dialectal information from the prosody information 
includes; 

first analysing means for determining the intonation 
pattern of the fundamental tone of the input speech 
and thereby the maximum and minimum values of the 
fundamental tone curve and their respective 
positions; 

second analysing means for determining the 
intonation pattern of the fundamental tone curve of 
the speech model and thereby the maximum and minimum 
values of the fundamental tone curve and their 
respective positions; and 

- comparison means for comparing the intonation 
pattern of the input speech with the intonation 
pattern of the speech model to identify a time 
difference between the occurrence of the maximum and 
minimum values of the fundamental tone curves of the 
incoming speech in relation to the maximum and 
minimum values of the fundamental tone curve of the 
speech model, the identified time difference being 
indicative of the dialectal characteristics of the 
input speech. 

The time difference, referred to above, may be determined 
in relation to an intonation pattern reference point. 

in the Swedish language, the difference, in terms of 
intonation pattern, between different dialects can be described 
by different points in time for word and sentence accent, i.e. 
30 the time difference can be determined in relation to an 
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intonation pattern reference point, for example, the point at 
which a consonant/vowel limit occurs. 

Thus, in a preferred arrangement for the present 
invention, the reference against which the time difference is 
measured, is the point at which the consonant /vowel boundary, 
i.e. the CV-boundary, occurs. 

The identified time difference which, as stated above, is 
indicative of the dialect in the input speech, i.e. the spoken 
language, is applied to the text-to-speech conversion units 12 
and 13 to enable the intonation pattern, and thereby the 
dialect, of the speech output of the system to be corrected so 
that it corresponds to the intonation pattern of the 
corresponding words and/or phrase of the input speech. Thus, 
this corrective process enables the dialectal information in 
the input speech to be incorporated into the output speech. 

As stated above, the fundamental tone curve of the speech 
model is based on information resulting from the lexical 
(orthography and transcription) and syntactic checks. In 
addition, the transcription information includes lexically 
abstracted accent information, of type stressed syllables, i.e. 
tonal word accents I and II, and information relating to the 
location of secondary accents, i.e. information given, for 
instance, in dictionaries. This information can be used to 
adjust the recognition pattern of the speech recognition model, 
for example, the Hidden Markov model, to take account of the 
transcription information. A more exact model of the input 
speech is, " therefore, obtained during the interpretation 
process . 

A further consequence of this speech model corrective 
process is that, through time, the speech model will have an 
intonation pattern which has been established by a training 
process . 

- 19 - 
SUBSTITUTE SHEET (RULE 26) 



9743707A1_I_> 



wo 97/43707 



PCT/SE97/00584 



Also, with the system and method of the present invention, 
the speech model is compared with a spoken input sequence, and 
any difference there between can be determined and used to 
bring the speech model into conformity with the spoken sequence 
5 and/or to determine stresses in the spoken sequence. 

In addition, the identification of the stresses in a 
spoken sequence makes it possible to determine the precise 
meaning of the spoken sequence in an unambiguous manner. In 
particular, relative sentence stresses can be determined by 
10 classifying the ratio between variations and declination of the 
fundamental tone curve, whereby emphasised sections, or 
individual words can be determined. In addition, the pitch of 
the speech can be determined from the declination of the 
fundamental tone curve. 

15 Thus, in order to take account of sentence stresses in the 

recognition and interpretation of the speech inputs to the 
speech-to-speech conversion system of the present invention, 
the prosody extraction means and the associated speech 
recognition and interpretation unit, for each of the languages 

20 A and B, is adapted to determine: 

a first ratio between the variation and declination 
of the fundamental tone curve of the input speech; 

a second ratio between the variation and declination 
of the fundamental tone curve of the speech model; 
25 and 

comparing the first and second ratios, any 
identified difference being used to determine 
sentence accent placements. 

Furthermore, classification of the ratio between the 
30 variation and declination of the fundamental tone curve, makes 
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•r,o relative sentence stresses, 
it possible to identify/determxne relatx 
and emphasised sections, or words. 

t.e relation between ^^^-i^::: ^ 
dynamic range 

■ obtained in ra=pect of the fundamental 
Tne infor.at.on obtained ^^^^^^ 

tone curve concerning „j eech inputs by the 

can be used for the ^o'"/'^"";; J„ J„ed, in the .anner 
units 1 and 2. i.e. ^"f;^;;;:rrstanding of the content 
outlined above, to obtain l^^^'^^^^" pattern of the 

t:: :r :rr;irtr»ith the input .eech. 
-"rrgrirctTn^:rr:entn:ra:rn: 

characteristics (--^""^.^"^ f ^'^"^ can be used to give an 
and stressing) of the input =P«=^- ^ increase the 

increased understanding of the i P ^^^^^^ 
probability that the "^;;-^:°;::tLected speech .odels can 
language of the ^P^^^^^X^anage.ent units 6 and 7 to obtain 
0 also be used by the database man g database units 

reguired speech ^f^^^^'ll^^. J^::,, a voice input to 
e and 9 for the formulation - 
the speech-to-speech conversion system. 

.ne ability to ^eadily in»rpret ^^^X" ^o^e 
,5 language using fundamental ^2lZZ.\,..s can be effected 
significance because .^^^ ..cognition system. The 

without having to f ^.^^^/^^ ,,„eby cost, of a speech 
.esult of this is that the .^e present 

recognition system, "^^^ „„„ia be possible with 

30 invention can - ™- » r,,^^, distinct advantages over 
known systems. These ate, 
known speech recognition systems. 
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The system is, therefore, adapted to recogn.se an. 
accurately interpret the content of speech inputs rn two, or 
Tre natural languages and to match 
char cteristics, e.g. dialect, of the voice responses o thos 
of the voice inputs. This process provides a user friendly 
s because tie language of the man-machine dialogue is in 
accordance with the dialect of the user concerned. 

The present invention is not limited to the embodiments 
outlined above, but can be modified within the scope of the 
appended patent claims and the inventive concept. 
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CLAIMS 

1 . In a voice responsive communication system/ a method for 
providing a spoken response to a speech input, said method 
including the steps of recognising and interpreting the speech 

5 input, and utilising the interpretation to obtain speech 
information data from a database for use in the formulation of 
the spoken response, characterised in that the database 
contains speech information data in at least two natural 
languages, in that said method is adapted to recognise and 

10 interpret speech inputs in said at least two languages and to 
provide spoken responses to speech inputs in said languages, 
and in that said method includes the further steps of 
evaluating a recognised speech input to determine the language 
of the input, effecting a dialogue with the database to obtain 

15 speech information data for the formulation of a spoken 
response in the language of the speech input, and converting 
the speech information data, obtained from the database, into 
said spoken response. 

2. A method as claimed in claim 1, characterised in that 
20 separate databases are used for each of said at least two 

languages . 

3. A method as claimed in claim 2, characterised in that said 
dialogue is effected with only that one of said databases which 
contains speech information data in the language of the input 

2 5 speech. 

4. A method as claimed in claim 2, characterised in that said 
dialogue is effected with that 6ne of said databases which 
contains speech information in the language of the input 
speech, and in that, in the event that at least part of the 

30 required speech information data for a spoken response is 
stored in another of said databases, said method includes the 
further steps of effecting a dialogue with said another of the 
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databases to obtain the required speech f 
translating the information data into the ^-^ua^.^^f .""^^ 
of the databases, combining the speech information data .rom 
the databases, and converting the combined speech information 
5 data into a spoRen response in the language of the speech 
input. 

5 A method as claimed in any one of the preceding claims. 
Characterised in that speech recognition and interpretation of 
a speech ihP- i= effected in at least two natural languages. 

10 6. A method as claimed in claim 5, characterised in that 
recognised parts, or sequences, of the speech input, "-^""^ 
frol said speech recognition and interpretation m the said at 
letst two natural languages, are evaluated to determine tne 
language of the speech input. 

ns 7 A method as claimed in claim 6, characterised in that the 
Ltcome Of the evaluation process is used to -erm.ne t^ e 
database with which the said dialogue xs 

the speech information data for a spoken response to the speech 
input . 

,0 8 A method as claimed in any one of the preceding claims, 
characterised in that the dialogue with a database and/or 
between datatases is effected using a database communication 
language, such as SQL (Structured Query Language). 

■ 9. A method as claimed in any one of the P"«^;"';^^^7 
characterised in that said speech recognition and 
chatacteriseo ^ of extracting prosody 

interpretation includes tne steps u 

information from a speech input, and obtaining daXect 
information from said prosody "/atd splech 

information being used in the conversion of said speeca 
information data, obtained from said database, into a spoken 
resporse, the spoken responses being in the same language anc 
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dialect as the speech input. 

10. A method as claimed in claim 9, characterised in that the 
prosody information extract from the speech input is the 
fundamental tone curve of the speech input. 

5 11. A method as claimed in claim 10, characterised by the 
steps of determining the intonation pattern of the fundamental 
tone curve of the input speech and thereby the maximum and 
minimum values of the fundamental tone curve and their 
respective positions; determining the intonation pattern of the 

10 fundamental tone curve of a speech model and thereby the 
maximum and minimum values of the fundamental tone curve and 
their respective positions; comparing the intonation pattern 
of the input speech with the intonation pattern of the speech 
model to identify a time difference between the occurrence of 

15 the maximum and minimum values of the fundamental tone curves 
of the incoming speech in relation to the maximum and minimum 
values of the fundamental tone curve of t^he speech model, the 
identified time difference being indicative of dialectal 
characteristics of the input speech. 

20 12. A method as claimed in claim 11, characterised in that the 
time difference is determined in relation to an intonation 
pattern reference point. 

13. A method as claimed in claim 12, characterised in that the 
intonation pattern reference point, against which the time 

25 difference is measured, is the point at which a consonant/vowel 
limit occurs. 

14. A method as claimed in any one of the claims 9 to 13, 
characterised by the step of obtaining information on sentence 
accents from said prosody information. 

30 15. A method as claimed in claim 14, characterised in that the 
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words in the speech model are checked lexically, in that the 
ohrases in the speech model are checked syntactically, in thaz 
the words and phrases which are not linguistically possible are 
excluded from the speech model, in that the orthography and 
5 phonetic transcription of the words in the speech model are 
checked, and in that the transcription information includes 
lexically abstracted accent information, of type stressed 
syllables, and information relating to the location of 
secondary accents - 

10 16. A method as claimed in claim 15, characterised in that the 
accent information relates to tonal word accent I and accent 
II. 

17. A method as claimed in any one of claims 14 to 16, 
characterised by the step of using said sentence accent 

15 information in the interpretation of the input speech. 

18. A voice responsive communication system which utilises a 
method as claimed in any one of the preceding claims for 
providing a spoken response to a speech input to the system. 

19. A speech-to-speech conversion system for providing, at the 
20 output thereof, spoken responses to speech inputs in at least 

two natural languages, including speech recognition means for 
the speech inputs; interpretation means for interpreting the 
content of the recognised speech inputs, and a database 
containing speech information data for use in the formulation 
25 of said spoken responses, characterised in that the speecn 
information data stored in the database is in the said at least 
two natural languages, in that the speech recognition ana 
interpretation means are adapted to recognise and interpret 
speech inputs in said at least two natural languages, and m 
30 that the system further includes evaluation means for 
evaluating the recognised speech inputs and determining tne 
language of the inputs, dialogue management means for effecting 
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a dialogue with the database to obtain said speech information 
data in the language of a speech input, and text-to-speech 
conversion means for converting the speech information data, 
obtained from the database, into a spoken response. 

5 

20. A speech-to-speech conversion system as claimed in claim 
19, characterised in that the system is adapted to receive 
speech inputs in two, or more, natural languages and co 
provide, at the output thereof, spoken responses in the 

10 language of the respective speech inputs, and in that the 
system includes, for each of the natural languages, speech 
recognition means, the inputs of each of the speech recognition 
means being connected to a common input for the system; speech 
evaluation means for determining, in dependence on the output 

15^ of each of the speech recognition means, the language of a 
speech input; a database containing speech information data 
for use in the formulation of spoken responses in the language 
of the database; dialogue management means for connection to 
a respective speech recognition means, i^n dependence on the 

20 language of the speech input, said management means being 
adapted to interpret the content of recognised speech and, on 
the basis of the interpretation, to access, and obtain speech 
information data from, at least a respective one of the 
databases; and text-to-speech conversion means for converting 

25 the speech information data obtained by said management means 
into spoken responses to respective speech inputs. 

21. A speech-to-speech conversion system as claimed in claim 
19/ characterised in that the system includes separate 
databases for each of said at least two languages. 

30 22. A speech-to-speech conversion system as claimed in claim 
21, characterised in that the system includes separate dialogue 
management means for each of the databases, each dialogue 
management means being adapted to effect a dialogue with, at 
least, a respective one of the databases. 
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03 A speech-to-speech conversion system as claimed in claim 

22, characterised in that each dialogue management means is 
adapted to effect a dialogue with each of the databases. 

24 A speech-to-speech conversion system as claimed in claim 

23, characterised in that the system includes translation means 
for translating the speech information data output of each of 
the databases into the language^) of the other databases. 

25 A speech-to-speech conversion system as claimed in claim 

24, characterised in that, in the event that at least part of 
the required speech information data for a spoken response is 
stored in a database in a different language to that which is 
required for the spoken response, said information is obtained 
from said database and translated by said translation means 
into the required language for the spoken response, and in that 
the translated speech information is used either alone, or m 
combination, with other speech information by the dialogue 
management means to provide an output for. application to the 
text-to-speech conversion means. 

26 A speech-to-speech conversion system as claimed in claim 
25; Characterised in that the system is adapted to receive 
speech inputs in two languages, in that the system includes, 
for each of the two languages, a database, dialogue management 
means and translation means, in that each of the dialogue 
management means is adapted to communicate with each of the 
databases, and in that the data output of each of the databases 
is connected directly to one of the dialogue management means 
and to the other of the management means via a translation 



30 



means 



27 A speech- to- speech conversion system as claimed in any one 
of the claims 19 to 26, characterised in that the system 
includes speech recognition and interpretation means for each 
of the said at least two natural languages, the inputs to the 
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speech recognition and interpretation means being connected to 
a common system input. 

28. A speech-to-speech conversion system as claimed in claim 
21, characterised in that recognised parts, or sequences, of 
the speech input, resulting from said speech recognition and 
interpretation in the said at least two natural languages, are 
evaluated by the evaluation means to determine the language of 
the speech input, 

29. A speech-to-speech conversion system as claimed in claim 
28, characterised in that the output of the evaluation means 
is used to select the database from which said speech 
information data will be obtained by said dialogue management 
means for the formulation of the spoken response to the speech 
input. 

30. A speech-to-speech conversion system as claimed in any one 
of claims 19 to 29, characterised in that^, the dialogue with a 
database, and/or between databases, is effected using a 
database communication language, such as SQL {Structured Query 
Language) . 

31. A speech-to-speech conversion system as claimed in any one 
of the preceding claims characterised in that said speech 
recognition and interpretation means include extraction means 
for extracting prosody information from the speech input, and 
means for obtaining dialectal information from said prosody 
information, said dialectal information being used by said 
text-to-speech conversion means in the conversion of said 
speech information data into the spoken response, the dialect 
of the spoken response being matched to that of the speech 
input . 

32. A speech-to-speech conversion system as claimed in claim 
31, characterised in that the prosody information extract from 
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the speech input is the fundamental tone curve of the speech 
input . 

33. A speech-to-speech conversion system as claimed in claim 
32, characterised the means for obtaining dialectal information 
^rom said prosody information includes first analysing means 
for determining the intonation pattern of the fundamental tone 
of the input speech and thereby the maximum and minimum values 
of the fundamental tone curve and their respective positions; 
second analysing means for determining the intonation pattern 
of the fundamental tone curve of the speech model and thereby 
Che maximum and minimum values of the fundamental tone curve 
and their respective positions; comparison means for comparing 
Che intonation pattern of the input speech with the intonation 
pattern of the speech model to identify a time difference 
between the occurrence of the maximum and minimum values of the 
fundamental tone curves of the incoming speech in relation to 
the maximum and minimum values of the fundamental tone curve 
of the speech model, the identified time difference being 
indicative of dialectal characteristics of the input speech. 

34 A speech-to-speech conversion system as claimed in claim 

33, characterised in that the time difference is determined in 
relation to an intonation pattern reference point. 

35. A speech-to-speech conversion system as claimed in claim 

34, characterised in that the intonation pattern- reference 
point, against which the time difference is measured, is the 
point at which a consonant /vowel limit occurs. 

36 A speech-to-speech conversion system as claimed in any one 
of the claims 31 to 35, characterised in that the system 
further includes means for obtaining information on sentence 
accents from said prosody information. 

37 . A speech-to-speech conversion system as claimed in claim 
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36, characterised in that the speech recognition means includes 
checking means for lexically checking the words in the speech 
model and for syntactically checking the phrases in the speech 
model, the words and phrases which are not linguistically 

5 possible being excluded from the speech model, in that the 
checking means are adapted to check the orthography and 
phonetic transcription of the words in the speech model, in 
that the transcription information includes lexically 
abstracted accent information, of type stressed syllables, and 
10 information relating to the location of secondary accent. 

38. A speech-to-speech conversion system as claimed in claim 

37 , characterised in that the accent information relates to 
tonal word accent I and accent II- 

39. A speech-to-speech conversion system as claimed in any one 
15 of claims 36 to 38, characterised in that said sentence accent 

information is used in the interpretation of the content of the 
recognised input speech . 

40. A speech-to-speech conversion system as claimed in any one 
of the claims 31 to 39, characterised in that sentence stresses 

20 are determined and used in the interpretation of the content 
of the recognised input speech. 

41 . A voice responsive communication system including a 
speech-to-speech conversion system as claimed in any one of the 
claims 19 to 40. 
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